Algorithmic Approaches to the String Barcoding Problem
نویسندگان
چکیده
This thesis deals with a heuristic approach based on Lagrangian relaxation to the string barcoding (SB) problem, a close cousin to the well-known combinatorial set cover (SC) problem. It has recently been proven to be NP-hard and has many real-world applications, particularly in the fields of medicine and biology. Given a set of sequences over some alphabet, DNA for instance, we aim at finding a set of short sequences, so-called probes, such that we are able to identify an unknown sample sequence as one of the input sequences by determining which probes are subsequences of the sample, and which are not. The problem is twofold: the determination of all possible probes and the selection of a suitable subset of minimum cardinality. The problem has been dealt with under various other names and has in this form been introduced by Rash and Gusfield in 2002. They proposed an exact approach based on integer linear programming and the use of suffix trees to generate a complete, nonredundant set of candidate probes. We evaluated several approaches for the SB as well as the SC problem. One of the leading heuristics for the SC problem, based on Lagrangian relaxation, has been proposed by Caprara et al. in 1999. We adapted the algorithm to see if it works equally well when applied to the structurally very similar SB problem. Though the results we obtained are somewhat mixed, the heuristic shows its strength with very complex instances and delivers much better results compared to simpler heuristics.
منابع مشابه
Algorithmic Perspectives of the String Barcoding Problems
1.1 INTRODUCTION Let Σ be a finite alphabet. A string is a concatenation of elements of Σ. The length of a string x, denoted by |x|, is the number of the characters that constitute this string. Let S be a set of strings over Σ. The simplest " binary-valued version " of the string barcoding problem discussed in this chapter is defined as follows [3, 17]: Problem name: String barcoding problem (S...
متن کاملThe String Barcoding Problem
In this paper we consider an approach to solve the string barcoding problem. This approach is based on an explicit reduction from the problem to the satisfiability problem.
متن کاملSemi-local String Comparison: Algorithmic Techniques and Applications
The longest common subsequence (LCS) problem is a classical problem in computer science. The semi-local LCS problem is a generalisation of the LCS problem, arising naturally in the context of string comparison. Apart from playing an important role in string algorithms, this problem turns out to have surprising connections with computational geometry, algebra, graph theory, as well as applicatio...
متن کاملHighly Scalable Algorithms for Robust String Barcoding
String barcoding is a recently introduced technique for genomic based identification of microorganisms. In this paper, we describe the engineering of highly scalable algorithms for robust string barcoding. Our methods enable distinguisher selection based on whole genomic sequences of hundreds of microorganisms of up to bacterial size, on a well equipped workstation. Experimental results on both...
متن کاملFast Kernel Methods for SVM Sequence Classifiers
In this work we study string kernel methods for sequence analysis and focus on the problem of species-level identification based on short DNA fragments known as barcodes. We introduce efficient sorting-based algorithms for exact string k-mer kernels and then describe a divide-and-conquer technique for kernels with mismatches. Our algorithm for the mismatch kernel matrix computation improves cur...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007